Review of Spatial transcriptomics technologies
Spatial transcriptomics data processing and storage in R
Finding spatially variable genes
Dimension reduction and clustering
Downloading the GitHub repository from the command line:
Or navigate to the GitHub repository and download the ZIP.
Running the following script will install and load all of the necessary packages.
Many of the most popular technologies for ST can be divided into two groups (Righelli et al. 2022):
Spot-based: Next generation sequencing (NGS) with spatial barcoding
Molecule-based: In-situ imaging of individual molecules
10X Genomics Visium Youtube Video
Colored probes attach to mRNA transcript from a target gene.
Quantify expression by imaging
Key challenge is extending to many genes
MERFISH uses (error robust) combinatorial labeling to increase the number of gene transcripts that can be measured.
Basic idea: assign a \(N\)-bit binary string to each gene. Then \(2^N\) genes can be encoded and measured after \(N\) sequential rounds of smFISH.
Fig 1A of Chen et al. (2015)
| Visium (Spot-based) | Molecule-based | |
|---|---|---|
| Spatial Resolution | 1-10 cells per spot | sub-cellular |
| Number of genes | Whole transcriptome (20,000+) | 100-10,000 (MERFISH) |
| Detection efficiency | ~15% | ~95% |
| Accessibility | Commercially available; raw data FASTQ | Raw images can be large and difficult to analyze |
Figure from (Moses and Pachter 2022)
Figure from (Moses and Pachter 2022)
Figure from (Moses and Pachter 2022)
Once processed, ST data can be represented using two matrices.
Count matrix \(Y\): For each of \(J\) spots record the number of reads from each of \(I\) genes.
Coordinate matrix \(X\): For each of \(J\) spots record the 2-dimensional spatial coordinate.
The SpatialExperiment package provides a container to store the count and coordinate matrix. Righelli et al. (2022)
The 10X genomics website has example ST datasets.
For this workshop, we will download samples from human cerebral cortex:
The data can be downloaded by navigating to the website or directly from the terminal (Unix):
mkdir cortex
cd cortex
curl -O https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Human_Brain_Section_1/V1_Human_Brain_Section_1_raw_feature_bc_matrix.tar.gz
curl -O https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Human_Brain_Section_1/V1_Human_Brain_Section_1_spatial.tar.gz
tar xf V1_Human_Brain_Section_1_raw_feature_bc_matrix.tar.gz
tar xf V1_Human_Brain_Section_1_spatial.tar.gzIf downloading the data from the terminal does not work on your computer, the folder can be downloaded from Google Drive. This link will be deleted within a week.
dir <- "./cortex"
### Step 1: Load the count matrix
require(DropletUtils)
fnm <- file.path(dir, "raw_feature_bc_matrix")
sce <- read10xCounts(fnm) #Single cell object
### Step 2: Load the tissue image
img <- readImgData(
path = file.path(dir, "spatial"),
sample_id = "brain")
# Step 3: Read spatial coordinates
fnm <- file.path(dir, "spatial", "tissue_positions_list.csv")
xy <- read.csv(fnm, header = FALSE,
col.names = c(
"barcode", "in_tissue", "array_row", "array_col",
"pxl_row_in_fullres", "pxl_col_in_fullres"))
ix <- match(sce$Barcode, xy$barcode) #Get cells in correct order
xy <- xy[ix,]
### Step 4: Construct feature metadata
rd <- DataFrame(symbol = rowData(sce)$Symbol)
### Step 5: Create the object
spe <- SpatialExperiment(
assays = list(counts = assay(sce)),
rowData = rd,
colData = DataFrame(xy),
spatialCoordsNames = c("pxl_col_in_fullres", "pxl_row_in_fullres"),
imgData = img,
sample_id = "brain")
### Step 6: Save the object
saveRDS(spe, file=file.path(dir,"spe.RDS"))Source: SpatialExperiment Vignette
For additional examples of data in the SpatialExperiment format, a good package is STexampleData :
A spatially variable gene is one whose expression depends on spatial location
In the human cerebral cortex, the genes MOBP, NPY, SNAP25, and PCP4 were previously reported by Weber et al. (2023) to be spatially variable. We will verify this by plotting.
Problem: Instead of gene symbols we are given ENSEMBL ID:
### Convert ENSEMBL ID to gene symbol
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- rownames(spe)
G_list <- getBM(filters="ensembl_gene_id",
attributes= c("ensembl_gene_id","hgnc_symbol"),
values=genes,
mart=mart)
no.match <- which(G_list$hgnc_symbol == "")
G_list[no.match,2] <- G_list[no.match,1]
ix <- match(G_list$ensembl_gene_id, rownames(spe))
rownames(spe)[ix] <- G_list$hgnc_symbol
interesting_genes <- which(rownames(spe) %in% c("MOBP","PCP4", "SNAP25","NPY"))### Normalize count matrix and subset to interesting genes
spe <- logNormCounts(spe)
Y.sub <- as.matrix(logcounts(spe)[interesting_genes,])
## Switch rows and columns and make dataframe
df <- as.data.frame(t(Y.sub))
## Add spatial coordinates
df$x <- spatialCoords(spe)[,1]; df$y <- spatialCoords(spe)[,2]### https://github.com/lmweber/nnSVG-analyses
### Author: Lukas Weber
df <- melt(df, id.vars=c("x","y"))
p <- ggplot(data=df,aes(x=x,y=y,color=value)) +
geom_point(size=0.05) +
coord_fixed() +
scale_color_gradientn(colors=c("gray90", "blue", "black"),
breaks=c(0,3,6)) +
facet_wrap(~variable, nrow=2) +
theme_bw() +
guides(color=guide_colorbar(ticks=FALSE)) +
labs(color="log count") +
theme(strip.text = element_text(face = "italic"),
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
pA statistical method can automatically identify spatially variable genes without relying on previous knowledge.
There are several existing methods, but it is still an area of active research.
Fast non-parametric test to see if the distribution of gene expression \(Y_{i \cdot}\) depends on the spatial locations \(X\). More specifically, does the difference in gene expression between cell \(i\) and \(j\) depend on the spatial distance between cell \(i\) and \(j\) ? Zhu, Sun, and Zhou (2021)
interesting_genes <- rownames(res$res_mtest)[order(res$res_mtest$combinedPval)[1:10]]
spe <- logNormCounts(spe)
Y.sub <- as.matrix(logcounts(spe)[interesting_genes,])
## Switch rows and columns and make dataframe
df <- as.data.frame(t(Y.sub))
## Add spatial coordinates
df$x <- spatialCoords(spe)[,1]; df$y <- spatialCoords(spe)[,2]
### Construct plot
df <- melt(df, id.vars=c("x","y"))
p <- ggplot(data=df,aes(x=x,y=y,color=value)) +
geom_point(size=0.05) +
coord_fixed() +
scale_color_gradientn(colors=c("gray90", "blue", "black"),
breaks=c(0,4,8)) +
facet_wrap(~variable, nrow=2) +
theme_bw() +
guides(color=guide_colorbar(ticks=FALSE)) +
labs(color="log count") +
theme(strip.text = element_text(face = "italic"),
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
pOur dataset has over \(30,000\) genes. However, many appear to be highly correlated (see the top SVG plot we made earlier).
If we account for the strong correlation, we may be able to summarize the data using a much smaller number of genes, thus reducing the “dimension”.
PCA begins by finding the line that best approximates the data (PC dimension 1). Next, it finds a line perpendicular to PC dimension 1 that best approximates the data (PC dimension 2). Then repeat to find PC dimension 3, 4, etc.
To reduce the data to dimension \(k\), we can project each data point onto the linear space spanned by the first \(k\) PC dimensions.
For example, when \(k = 1\) we project each point onto a line. When \(k = 2\), we project each point onto a plane.
set.seed(323)
data <- data.frame(mvrnorm(n=100, mu=c(0,0), Sigma=matrix(c(1,0.95,0.95,1),nrow=2)))
data$color <- data[,1]
p <- ggplot(data,aes(x=X1,y=X2,color=color)) +
geom_point(size=2) +
scale_color_gradient(low="lightblue", high="firebrick") +
theme_bw() +
guides(color="none") +
xlab("x") + ylab("y")
ppca <- prcomp(data[,c("X1", "X2")])
data$pc1 <- pca$x[,1]
p <- ggplot(data=data,aes(x=pc1,y=0,color=color)) +
geom_point(size=2) +
scale_color_gradient(low="lightblue", high="firebrick") +
theme_bw() +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()) +
guides(color="none") +
xlab("PC1 Score")
pFor computational speed, it is standard practice to subset to the ~2000 most highly variable genes before running PCA:
What do you notice about the genes that contribute to PC dimension 1?
df <- data.frame(x=spatialCoords(spe)[,1],
y=spatialCoords(spe)[,2],
PC1=reducedDim(spe)[,1],
PC2=reducedDim(spe)[,2],
PC3=reducedDim(spe)[,3])
df <- df |> pivot_longer(cols=c(PC1,PC2,PC3))
p <- ggplot(data=df,aes(x=x,y=y,color=value)) +
geom_point(size=0.5) +
coord_fixed()+
scale_color_gradient2(low="blue",mid="grey", high="red") +
theme_bw()+
facet_wrap(~name,nrow=1)
pThe PCA results show that there are groups of spots that have very similar expression levels. By applying a clustering algorithm we can explicitly define these groups of cells. Many approaches for clustering first build the shared nearest neighbor (SNN) graph of the cells.
By j_ham3 - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17125894
A marker gene is one that is differentially expressed in a particular cluster. Marker genes are used to better understand the biological function of the identified clusters.
top5 <- matrix(NA, nrow=length(markers), ncol=5)
rownames(top5) <- paste("Cluster", 1:length(markers))
colnames(top5) <- paste("Marker", 1:5)
for(i in 1:length(markers)) {
top5[i,] <- rownames(markers[[i]])[order(markers[[i]]$mean.AUC, decreasing=TRUE)[1:5]]
}
top5 <- as.data.frame(top5)
knitr::kable(top5)| Marker 1 | Marker 2 | Marker 3 | Marker 4 | Marker 5 | |
|---|---|---|---|---|---|
| Cluster 1 | ENSG00000269028 | SPARC | AQP4 | SPARCL1 | FOS |
| Cluster 2 | MBP | PLP1 | MOBP | BCAS1 | CNP |
| Cluster 3 | SNAP25 | SYT1 | NEFL | TUBA1B | STMN2 |
| Cluster 4 | SLC1A2 | MT3 | SLC1A3 | ATP1A2 | ENSG00000269028 |
| Cluster 5 | ENC1 | VSNL1 | NRGN | YWHAH | MEF2C |
| Cluster 6 | MOBP | MBP | PLP1 | BCAS1 | ERMN |
| Cluster 7 | MAP1A | MAP2 | SLC17A7 | PRKCB | NPTX1 |
Per (Maynard et al. 2021),
MOBP is a white matter/oligodendrocyte marker (Cluster 2)
SNAP25 is a gray matter/neuron marker (Cluster 3)
Repeat the analysis of finding SVGs and clusters for the data (of your choice) in the STexampleData package